Inverse-Category-Frequency based Supervised Term Weighting Schemes for Text Categorization

نویسندگان

  • Deqing Wang
  • Hui Zhang
چکیده

Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval (IR) field. The intuition behind idf for text categorization seems less reasonable than IR. In this paper, we introduce inverse category frequency (icf) into term weighting scheme and propose two novel approaches, i.e., tf.icf and icf-based supervised term weighting schemes. The tf.icf adopts icf to substitute idf factor and favors terms occurring in fewer categories, rather than fewer documents. And the icf-based approach combines icf and relevance frequency (rf) to weight terms in a supervised way. Our cross-classifier and cross-corpus experiments have shown that our proposed approaches are superior or comparable to six supervised term weighting schemes and three traditional schemes in terms of macro-F1 and micro-F1.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Inverse Category Frequency based supervised term weighting scheme for text categorization

Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval (IR) field. The intuition behind idf for text categorization seems less reasonable than IR. In this paper, we introduce inverse category frequency (icf) int...

متن کامل

A Novel Term Weighting Scheme Midf for Text Categorization

Text categorization is a task of automatically assigning documents to a set of predefined categories. Usually it involves a document representation method and term weighting scheme. This paper proposes a new term weighting scheme called Modified Inverse Document Frequency (MIDF) to improve the performance of text categorization. The document represented in MIDF is trained using the support vect...

متن کامل

Probabilistic Supervised Term Weighting for Binary Text Categorization

In text categorization, the class agnostic (unsupervised) tf× idf term weighting scheme has seen widespread usage. Recently proposed supervised term weighting methods including tf×rf and tf× δidf make use of term class distribution to improve the classification accuracy. However, they only account for the presence of terms in classes, ignoring the absence of key categorical terms, which may giv...

متن کامل

Does a New Simple Gaussian Weighting Approach Perform Well in Text Categorization?

A new approach to the Text Categorization problem is here presented. It is called Gaussian Weighting and it is a supervised learning algorithm that, during the training phase, estimates two very simple and easily computable statistics which are: the Presence P, how much a term / is present in a category c\ the Expressiveness E, how much / is present outside c in the rest of the domain. Once the...

متن کامل

An Integrated and Improved Approach to Terms Weighting in Text Classification

Traditional text classification methods utilize term frequency (tf) and inverse document frequency (idf) as the main method for information retrieval. Term weighting has been applied to achieve high performance in text classification. Although TFIDF is a popular method, it is not using class information. This paper provides an improved approach for supervised weighting in the TFIDF model. The t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Inf. Sci. Eng.

دوره 29  شماره 

صفحات  -

تاریخ انتشار 2013